<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file.  Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other  -->
<!-- code is quite ugly!  Later versions should be better.                -->
<head>
    <meta charset="utf-8">
    <meta name="citation_title" content="Neural Networks and Deep Learning">
    <meta name="citation_author" content="Nielsen, Michael A.">
    <meta name="citation_publication_date" content="2015">
    <meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
    <meta name="citation_publisher" content="Determination Press">
    <meta name="citation_fulltext_world_readable" content="">
    <link rel="icon" href="http://neuralnetworksanddeeplearning.com/nnadl_favicon.ICO" />
    <title>Neural networks and deep learning</title>
    <script src="assets/jquery.min.js"></script>
    <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: {inlineMath: [['$','$']]},
        "HTML-CSS": 
          {scale: 92},
        TeX: { equationNumbers: { autoNumber: "AMS" }}});
    </script>
    <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


    <link href="assets/style.css" rel="stylesheet">
    <link href="assets/pygments.css" rel="stylesheet">
    <link rel="stylesheet" href="https://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">

<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */

@font-face {
    font-family: 'MJX_Math';
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); /* IE9 Compat Modes */
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff')  format('woff'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf')  format('opentype'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}

@font-face {
    font-family: 'MJX_Main';
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); /* IE9 Compat Modes */
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff')  format('woff'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf')  format('opentype'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>

  </head>
  <body><div class="header"><h1 class="chapter_number">
  <a href="chap2.html">CHAPTER 2</a></h1>
  <h1 class="chapter_title"><a href="chap2.html">How the backpropagation algorithm works</a></h1></div><div class="section"><div id="toc"> 
<p class="toc_title"><a href="index.html">Neural Networks and Deep Learning</a></p><p class="toc_not_mainchapter"><a href="about.html">What this book is about</a></p><p class="toc_not_mainchapter"><a href="exercises_and_problems.html">On the exercises and problems</a></p><p class='toc_mainchapter'><a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a><a href="chap1.html">Using neural nets to recognize handwritten digits</a><div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;"><p class="toc_section"><ul><a href="chap1.html#perceptrons"><li>Perceptrons</li></a><a href="chap1.html#sigmoid_neurons"><li>Sigmoid neurons</li></a><a href="chap1.html#the_architecture_of_neural_networks"><li>The architecture of neural networks</li></a><a href="chap1.html#a_simple_network_to_classify_handwritten_digits"><li>A simple network to classify handwritten digits</li></a><a href="chap1.html#learning_with_gradient_descent"><li>Learning with gradient descent</li></a><a href="chap1.html#implementing_our_network_to_classify_digits"><li>Implementing our network to classify digits</li></a><a href="chap1.html#toward_deep_learning"><li>Toward deep learning</li></a></ul></p></div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() { 
   var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
   };
   $('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a><a href="chap2.html">How the backpropagation algorithm works</a><div id="toc_how_the_backpropagation_algorithm_works" style="display: none;"><p class="toc_section"><ul><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"><li>Warm up: a fast matrix-based approach to computing the output  from a neural network</li></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function"><li>The two assumptions we need about the cost function</li></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$"><li>The Hadamard product, $s \odot t$</li></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation"><li>The four fundamental equations behind backpropagation</li></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)"><li>Proof of the four fundamental equations (optional)</li></a><a href="chap2.html#the_backpropagation_algorithm"><li>The backpropagation algorithm</li></a><a href="chap2.html#the_code_for_backpropagation"><li>The code for backpropagation</li></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm"><li>In what sense is backpropagation a fast algorithm?</li></a><a href="chap2.html#backpropagation_the_big_picture"><li>Backpropagation: the big picture</li></a></ul></p></div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() { 
   var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
   };
   $('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a><a href="chap3.html">Improving the way neural networks learn</a><div id="toc_improving_the_way_neural_networks_learn" style="display: none;"><p class="toc_section"><ul><a href="chap3.html#the_cross-entropy_cost_function"><li>The cross-entropy cost function</li></a><a href="chap3.html#overfitting_and_regularization"><li>Overfitting and regularization</li></a><a href="chap3.html#weight_initialization"><li>Weight initialization</li></a><a href="chap3.html#handwriting_recognition_revisited_the_code"><li>Handwriting recognition revisited: the code</li></a><a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters"><li>How to choose a neural network's hyper-parameters?</li></a><a href="chap3.html#other_techniques"><li>Other techniques</li></a></ul></p></div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() { 
   var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
   };
   $('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a><a href="chap4.html">A visual proof that neural nets can compute any function</a><div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;"><p class="toc_section"><ul><a href="chap4.html#two_caveats"><li>Two caveats</li></a><a href="chap4.html#universality_with_one_input_and_one_output"><li>Universality with one input and one output</li></a><a href="chap4.html#many_input_variables"><li>Many input variables</li></a><a href="chap4.html#extension_beyond_sigmoid_neurons"><li>Extension beyond sigmoid neurons</li></a><a href="chap4.html#fixing_up_the_step_functions"><li>Fixing up the step functions</li></a><a href="chap4.html#conclusion"><li>Conclusion</li></a></ul></p></div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() { 
   var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
   };
   $('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a><a href="chap5.html">Why are deep neural networks hard to train?</a><div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;"><p class="toc_section"><ul><a href="chap5.html#the_vanishing_gradient_problem"><li>The vanishing gradient problem</li></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><li>What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</li></a><a href="chap5.html#unstable_gradients_in_more_complex_networks"><li>Unstable gradients in more complex networks</li></a><a href="chap5.html#other_obstacles_to_deep_learning"><li>Other obstacles to deep learning</li></a></ul></p></div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() { 
   var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
   };
   $('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a><a href="chap6.html">Deep learning</a><div id="toc_deep_learning" style="display: none;"><p class="toc_section"><ul><a href="chap6.html#introducing_convolutional_networks"><li>Introducing convolutional networks</li></a><a href="chap6.html#convolutional_neural_networks_in_practice"><li>Convolutional neural networks in practice</li></a><a href="chap6.html#the_code_for_our_convolutional_networks"><li>The code for our convolutional networks</li></a><a href="chap6.html#recent_progress_in_image_recognition"><li>Recent progress in image recognition</li></a><a href="chap6.html#other_approaches_to_deep_neural_nets"><li>Other approaches to deep neural nets</li></a><a href="chap6.html#on_the_future_of_neural_networks"><li>On the future of neural networks</li></a></ul></p></div>
<script>
$('#toc_deep_learning_reveal').click(function() { 
   var src = $('#toc_img_deep_learning').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_deep_learning").attr('src', 'images/arrow.png');
   };
   $('#toc_deep_learning').toggle('fast', function() {});  
});</script><p class="toc_not_mainchapter"><a href="sai.html">Appendix: Is there a <em>simple</em> algorithm for intelligence?</a></p><p class="toc_not_mainchapter"><a href="acknowledgements.html">Acknowledgements</a></p><p class="toc_not_mainchapter"><a href="faq.html">Frequently Asked Questions</a></p>
<hr>
<p class="sidebar"> If you benefit from the book, please make a small
donation.  I suggest $5, but you can choose the amount.</p>

<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="hosted_button_id" value="5K9YAHR4X84RN">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>

<p class="sidebar">Alternately, you can make a donation by sending me
Bitcoin, at address <span style="font-size: 0.7em">1Kd6tXH5SDAmiFb49J9hknG5pqj7KStSAx</span></p>

<!--
<hr>

<p class="sidebar"> If you benefit from the book, please make a small
donation.  I suggest $3, but you can choose the amount.</p>

<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="encrypted" value="-----BEGIN PKCS7-----MIIHTwYJKoZIhvcNAQcEoIIHQDCCBzwCAQExggEwMIIBLAIBADCBlDCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20CAQAwDQYJKoZIhvcNAQEBBQAEgYAtusFIFTgWVpgZsMgI9zMrWRAFFKQqeFiE6ay1nbmP360YzPtR+vvCXwn214Az9+F9g7mFxe0L+m9zOCdjzgRROZdTu1oIuS78i0TTbcbD/Vs/U/f9xcmwsdX9KYlhimfsya0ydPQ2xvr4iSGbwfNemIPVRCTadp/Y4OQWWRFKGTELMAkGBSsOAwIaBQAwgcwGCSqGSIb3DQEHATAUBggqhkiG9w0DBwQIK5obVTaqzmyAgajgc4w5t7l6DjTGVI7k+4UyO3uafxPac23jOyBGmxSnVRPONB9I+/Q6OqpXZtn8JpTuzFmuIgkNUf1nldv/DA1mhPOeeVxeuSGL8KpWxpJboKZ0mEu9b+0FJXvZW+snv0jodnRDtI4g0AXDZNPyRWIdJ3m+tlYfsXu4mQAe0q+CyT+QrSRhPGI/llicF4x3rMbRBNqlDze/tFqp/jbgW84Puzz6KyxAez6gggOHMIIDgzCCAuygAwIBAgIBADANBgkqhkiG9w0BAQUFADCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wHhcNMDQwMjEzMTAxMzE1WhcNMzUwMjEzMTAxMzE1WjCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMFHTt38RMxLXJyO2SmS+Ndl72T7oKJ4u4uw+6awntALWh03PewmIJuzbALScsTS4sZoS1fKciBGoh11gIfHzylvkdNe/hJl66/RGqrj5rFb08sAABNTzDTiqqNpJeBsYs/c2aiGozptX2RlnBktH+SUNpAajW724Nv2Wvhif6sFAgMBAAGjge4wgeswHQYDVR0OBBYEFJaffLvGbxe9WT9S1wob7BDWZJRrMIG7BgNVHSMEgbMwgbCAFJaffLvGbxe9WT9S1wob7BDWZJRroYGUpIGRMIGOMQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFjAUBgNVBAcTDU1vdW50YWluIFZpZXcxFDASBgNVBAoTC1BheVBhbCBJbmMuMRMwEQYDVQQLFApsaXZlX2NlcnRzMREwDwYDVQQDFAhsaXZlX2FwaTEcMBoGCSqGSIb3DQEJARYNcmVAcGF5cGFsLmNvbYIBADAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAIFfOlaagFrl71+jq6OKidbWFSE+Q4FqROvdgIONth+8kSK//Y/4ihuE4Ymvzn5ceE3S/iBSQQMjyvb+s2TWbQYDwcp129OPIbD9epdr4tJOUNiSojw7BHwYRiPh58S1xGlFgHFXwrEBb3dgNbMUa+u4qectsMAXpVHnD9wIyfmHMYIBmjCCAZYCAQEwgZQwgY4xCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEWMBQGA1UEBxMNTW91bnRhaW4gVmlldzEUMBIGA1UEChMLUGF5UGFsIEluYy4xEzARBgNVBAsUCmxpdmVfY2VydHMxETAPBgNVBAMUCGxpdmVfYXBpMRwwGgYJKoZIhvcNAQkBFg1yZUBwYXlwYWwuY29tAgEAMAkGBSsOAwIaBQCgXTAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA4MDUxMzMyMTRaMCMGCSqGSIb3DQEJBDEWBBRtGLYvbZ45sWVegWVP2CuXTHPmJTANBgkqhkiG9w0BAQEFAASBgKgrMHMINfV7yVuZgcTjp8gUzejPF2x2zRPU/G8pKUvYIl1F38TjV2pe4w0QXcGMJRT8mQfxHCy9UmF3LfblH8F0NSMMDrZqu3M0eLk96old+L0Xl6ING8l3idFDkLagE+lZK4A0rNV35aMci3VLvjQ34CvEj7jaHeLpbkgk/l6v-----END PKCS7-----
">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>

-->

<hr>
<span class="sidebar_title">Sponsors</span>
<br/>

<a href="https://lambdalabs.com/?utm_source=neuralnetworksdeeplearning&utm_medium=banner&utm_campaign=blogin&utm_content=rbannerimg">
  <img src="assets/lambda.png" width="200px" style="padding: 3px 0px 0px 10px; border-style: none;">
</a>
<br>
<div style="line-height: 1.2; padding-bottom: 12px; font-size: 0.8;">
  <a href="https://lambdalabs.com/?utm_source=neuralnetworksdeeplearning&utm_medium=banner&utm_campaign=blogin&utm_content=rtext">Deep Learning Workstations, Servers, and Laptops</a>
</div>

<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='200px' style="padding: 5px 0px 10px 10px; border-style: none;"></a>

<a href='http://www.tineye.com'><img src='assets/tineye.png' width='200px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='210px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/> 

<p class="sidebar">Thanks to all the <a
href="supporters.html">supporters</a> who made the book possible, with
especial thanks to Pavel Dudrenov.  Thanks also to all the
contributors to the <a href="bugfinder.html">Bugfinder Hall of
Fame</a>.  </p>

<hr>
<span class="sidebar_title">Resources</span>

<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>

<p class="sidebar"><a href="faq.html">Book FAQ</a></p>

<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Code repository</a></p>

<p class="sidebar">
<a href="https://michaelnielsenupdates.substack.com/subscribe">Michael Nielsen's project announcement mailing list</a>
</p>

<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
Goodfellow, Yoshua Bengio, and Aaron Courville</p>

<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>

<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>

<p class="sidebar">
By <a href="http://michaelnielsen.org">Michael Nielsen</a> / Dec 2019
</p>
</div>
</p><p>In the <a href="chap1.html">last chapter</a> we saw how neural networks canlearn their weights and biases using the gradient descent algorithm.There was, however, a gap in our explanation: we didn't discuss how tocompute the gradient of the cost function.  That's quite a gap!  Inthis chapter I'll explain a fast algorithm for computing suchgradients, an algorithm known as <em>backpropagation</em>.  </p><p>The backpropagation algorithm was originally introduced in the 1970s,but its importance wasn't fully appreciated until a<a href="http://www.nature.com/nature/journal/v323/n6088/pdf/323533a0.pdf">famous  1986 paper</a> by<a href="http://en.wikipedia.org/wiki/David_Rumelhart">David  Rumelhart</a>,<a href="http://www.cs.toronto.edu/&#126;hinton/">Geoffrey  Hinton</a>, and<a href="http://en.wikipedia.org/wiki/Ronald_J._Williams">Ronald  Williams</a>.  That paper describes severalneural networks where backpropagation works far faster than earlierapproaches to learning, making it possible to use neural nets to solveproblems which had previously been insoluble.  Today, thebackpropagation algorithm is the workhorse of learning in neuralnetworks.</p><p>This chapter is more mathematically involved than the rest of thebook.  If you're not crazy about mathematics you may be tempted toskip the chapter, and to treat backpropagation as a black box whosedetails you're willing to ignore.  Why take the time to study thosedetails?</p><p>The reason, of course, is understanding.  At the heart ofbackpropagation is an expression for the partial derivative $\partialC / \partial w$ of the cost function $C$ with respect to any weight$w$ (or bias $b$) in the network.  The expression tells us how quicklythe cost changes when we change the weights and biases.  And while theexpression is somewhat complex, it also has a beauty to it, with eachelement having a natural, intuitive interpretation.  And sobackpropagation isn't just a fast algorithm for learning.  It actuallygives us detailed insights into how changing the weights and biaseschanges the overall behaviour of the network.  That's well worthstudying in detail.</p><p>With that said, if you want to skim the chapter, or jump straight tothe next chapter, that's fine.  I've written the rest of the book tobe accessible even if you treat backpropagation as a black box.  Thereare, of course, points later in the book where I refer back to resultsfrom this chapter.  But at those points you should still be able tounderstand the main conclusions, even if you don't follow all thereasoning.</p><p><h3><a name="warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"></a><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network">Warm up: a fast matrix-based approach to computing the output  from a neural network</a></h3></p><p>Before discussing backpropagation, let's warm up with a fastmatrix-based algorithm to compute the output from a neural network.We actually already briefly saw this algorithm<a href="chap1.html#implementing_our_network_to_classify_digits">near  the end of the last chapter</a>, but I described it quickly, so it'sworth revisiting in detail.  In particular, this is a good way ofgetting comfortable with the notation used in backpropagation, in afamiliar context.</p><p>Let's begin with a notation which lets us refer to weights in thenetwork in an unambiguous way.  We'll use $w^l_{jk}$ to denote theweight for the connection from the $k^{\rm th}$ neuron in the$(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$layer.  So, for example, the diagram below shows the weight on aconnection from the fourth neuron in the second layer to the secondneuron in the third layer of a network:<center><img src="images/tikz16.png"/></center>This notation is cumbersome at first, and it does take some work tomaster.  But with a little effort you'll find the notation becomeseasy and natural.  One quirk of the notation is the ordering of the$j$ and $k$ indices.  You might think that it makes more sense to use$j$ to refer to the input neuron, and $k$ to the output neuron, notvice versa, as is actually done.  I'll explain the reason for thisquirk below.</p><p>We use a similar notation for the network's biases and activations.Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron inthe $l^{\rm th}$ layer.  And we use $a^l_j$ for the activation of the$j^{\rm th}$ neuron in the $l^{\rm th}$ layer.  The following diagramshows examples of these notations in use:<center><img src="images/tikz17.png"/></center>With these notations, the activation $a^{l}_j$ of the $j^{\rm th}$neuron in the $l^{\rm th}$ layer is related to the activations in the$(l-1)^{\rm th}$ layer by the equation (compareEquation <span id="margin_894765571383_reveal" class="equation_link">(4)</span><span id="margin_894765571383" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_894765571383_reveal').click(function() {$('#margin_894765571383').toggle('slow', function() {});});</script> and surroundingdiscussion in the last chapter)<a class="displaced_anchor" name="eqtn23"></a>\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),\tag{23}\end{eqnarray}where the sum is over all neurons $k$ in the $(l-1)^{\rm th}$ layer.  Torewrite this expression in a matrix form we define a <em>weight  matrix</em> $w^l$ for each layer, $l$.  The entries of the weight matrix$w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons,that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$.Similarly, for each layer $l$ we define a <em>bias vector</em>, $b^l$.You can probably guess how this works - the components of the biasvector are just the values $b^l_j$, one component for each neuron inthe $l^{\rm th}$ layer.  And finally, we define an activation vector $a^l$whose components are the activations $a^l_j$.</p><p>The last ingredient we need to rewrite <span id="margin_125223678996_reveal" class="equation_link">(23)</span><span id="margin_125223678996" class="marginequation" style="display: none;"><a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_125223678996_reveal').click(function() {$('#margin_125223678996').toggle('slow', function() {});});</script> in amatrix form is the idea of vectorizing a function such as $\sigma$.We met vectorization briefly in the last chapter, but to recap, theidea is that we want to apply a function such as $\sigma$ to everyelement in a vector $v$.  We use the obvious notation $\sigma(v)$ todenote this kind of elementwise application of a function.  That is,the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$.As an example, if we have the function $f(x) = x^2$ then thevectorized form of $f$ has the effect<a class="displaced_anchor" name="eqtn24"></a>\begin{eqnarray}  f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)  = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]  = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right],\tag{24}\end{eqnarray}that is, the vectorized $f$ just squares every element of the vector.</p><p>With these notations in mind, Equation <span id="margin_3155761849_reveal" class="equation_link">(23)</span><span id="margin_3155761849" class="marginequation" style="display: none;"><a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_3155761849_reveal').click(function() {$('#margin_3155761849').toggle('slow', function() {});});</script> canbe rewritten in the beautiful and compact vectorized form<a class="displaced_anchor" name="eqtn25"></a>\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l).\tag{25}\end{eqnarray}This expression gives us a much more global way of thinking about howthe activations in one layer relate to activations in the previouslayer: we just apply the weight matrix to the activations, then addthe bias vector, and finally apply the $\sigma$ function<a  id="quirk"></a>*<span class="marginnote">*By the way, it's this expression that  motivates the quirk in the $w^l_{jk}$ notation mentioned earlier.  If we used $j$ to index the input neuron, and $k$ to index the  output neuron, then we'd need to replace the weight matrix in  Equation <span id="margin_940105425902_reveal" class="equation_link">(25)</span><span id="margin_940105425902" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_940105425902_reveal').click(function() {$('#margin_940105425902').toggle('slow', function() {});});</script> by the transpose of the  weight matrix.  That's a small change, but annoying, and we'd lose  the easy simplicity of saying (and thinking) "apply the weight  matrix to the activations".</span>.  That global view is often easier andmore succinct (and involves fewer indices!) than the neuron-by-neuronview we've taken to now.  Think of it as a way of escaping index hell,while remaining precise about what's going on.  The expression is alsouseful in practice, because most matrix libraries provide fast ways ofimplementing matrix multiplication, vector addition, andvectorization.  Indeed, the<a href="chap1.html#implementing_our_network_to_classify_digits">code</a>in the last chapter made implicit use of this expression to computethe behaviour of the network.</p><p>When using Equation <span id="margin_474495386882_reveal" class="equation_link">(25)</span><span id="margin_474495386882" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_474495386882_reveal').click(function() {$('#margin_474495386882').toggle('slow', function() {});});</script> to compute $a^l$,we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$along the way.  This quantity turns out to be useful enough to beworth naming: we call $z^l$ the <em>weighted input</em> to the neuronsin layer $l$.  We'll make considerable use of the weighted input $z^l$later in the chapter.  Equation <span id="margin_711045846698_reveal" class="equation_link">(25)</span><span id="margin_711045846698" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_711045846698_reveal').click(function() {$('#margin_711045846698').toggle('slow', function() {});});</script> issometimes written in terms of the weighted input, as $a^l =\sigma(z^l)$.  It's also worth noting that $z^l$ has components $z^l_j= \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, that is, $z^l_j$ is just theweighted input to the activation function for neuron $j$ in layer $l$.</p><p><h3><a name="the_two_assumptions_we_need_about_the_cost_function"></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function">The two assumptions we need about the cost function</a></h3></p><p>The goal of backpropagation is to compute the partial derivatives$\partial C / \partial w$ and $\partial C / \partial b$ of the costfunction $C$ with respect to any weight $w$ or bias $b$ in thenetwork.  For backpropagation to work we need to make two mainassumptions about the form of the cost function.  Before stating thoseassumptions, though, it's useful to have an example cost function inmind.  We'll use the quadratic cost function from last chapter(c.f. Equation <span id="margin_830124720181_reveal" class="equation_link">(6)</span><span id="margin_830124720181" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_830124720181_reveal').click(function() {$('#margin_830124720181').toggle('slow', function() {});});</script>).  In the notation ofthe last section, the quadratic cost has the form<a class="displaced_anchor" name="eqtn26"></a>\begin{eqnarray}  C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,\tag{26}\end{eqnarray}where: $n$ is the total number of training examples; the sum is overindividual training examples, $x$; $y = y(x)$ is the correspondingdesired output; $L$ denotes the number of layers in the network; and$a^L = a^L(x)$ is the vector of activations output from the networkwhen $x$ is input.</p><p>Okay, so what assumptions do we need to make about our cost function,$C$, in order that backpropagation can be applied?  The firstassumption we need is that the cost function can be written as anaverage $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ forindividual training examples, $x$.  This is the case for the quadraticcost function, where the cost for a single training example is $C_x =\frac{1}{2} \|y-a^L \|^2$.  This assumption will also hold true forall the other cost functions we'll meet in this book.</p><p>The reason we need this assumption is because what backpropagationactually lets us do is compute the partial derivatives $\partial C_x/ \partial w$ and $\partial C_x / \partial b$ for a single trainingexample.  We then recover $\partial C / \partial w$ and $\partial C/ \partial b$ by averaging over training examples.  In fact, with thisassumption in mind, we'll suppose the training example $x$ has beenfixed, and drop the $x$ subscript, writing the cost $C_x$ as $C$.We'll eventually put the $x$ back in, but for now it's a notationalnuisance that is better left implicit.</p><p>The second assumption we make about the cost is that it can be writtenas a function of the outputs from the neural network:<center><img src="images/tikz18.png"/></center>For example, the quadratic cost function satisfies this requirement,since the quadratic cost for a single training example $x$ may bewritten as<a class="displaced_anchor" name="eqtn27"></a>\begin{eqnarray}  C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2,\tag{27}\end{eqnarray}and thus is a function of the output activations.  Of course, thiscost function also depends on the desired output $y$, and you maywonder why we're not regarding the cost also as a function of $y$.Remember, though, that the input training example $x$ is fixed, and sothe output $y$ is also a fixed parameter.  In particular, it's notsomething we can modify by changing the weights and biases in any way,i.e., it's not something which the neural network learns.  And so itmakes sense to regard $C$ as a function of the output activations$a^L$ alone, with $y$ merely a parameter that helps define thatfunction.</p><p></p><p></p><p></p><p><h3><a name="the_hadamard_product_$s_\odot_t$"></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$">The Hadamard product, $s \odot t$</a></h3></p><p>The backpropagation algorithm is based on common linear algebraicoperations - things like vector addition, multiplying a vector by amatrix, and so on.  But one of the operations is a little lesscommonly used.  In particular, suppose $s$ and $t$ are two vectors ofthe same dimension.  Then we use $s \odot t$ to denote the<em>elementwise</em> product of the two vectors.  Thus the components of$s \odot t$ are just $(s \odot t)_j = s_j t_j$.  As an example,<a class="displaced_anchor" name="eqtn28"></a>\begin{eqnarray}\left[\begin{array}{c} 1 \\ 2 \end{array}\right]   \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].\tag{28}\end{eqnarray}This kind of elementwise multiplication is sometimes called the<em>Hadamard product</em> or <em>Schur product</em>.  We'll refer to it asthe Hadamard product.  Good matrix libraries usually provide fastimplementations of the Hadamard product, and that comes in handy whenimplementing backpropagation.</p><p><h3><a name="the_four_fundamental_equations_behind_backpropagation"></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">The four fundamental equations behind backpropagation</a></h3></p><p>Backpropagation is about understanding how changing the weights andbiases in a network changes the cost function.  Ultimately, this meanscomputing the partial derivatives $\partial C / \partial w^l_{jk}$ and$\partial C / \partial b^l_j$.  But to compute those, we firstintroduce an intermediate quantity, $\delta^l_j$, which we call the<em>error</em> in the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer.Backpropagation will give us a procedure to compute the error$\delta^l_j$, and then will relate $\delta^l_j$ to $\partial C/ \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.</p><p>To understand how the error is defined, imagine there is a demon inour neural network:<center><img src="images/tikz19.png"/></center>The demon sits at the $j^{\rm th}$ neuron in layer $l$.  As the input to theneuron comes in, the demon messes with the neuron's operation.  Itadds a little change $\Delta z^l_j$ to the neuron's weighted input, sothat instead of outputting $\sigma(z^l_j)$, the neuron instead outputs$\sigma(z^l_j+\Delta z^l_j)$.  This change propagates through laterlayers in the network, finally causing the overall cost to change byan amount $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$.</p><p>Now, this demon is a good demon, and is trying to help you improve thecost, i.e., they're trying to find a $\Delta z^l_j$ which makes thecost smaller.  Suppose $\frac{\partial C}{\partial z^l_j}$ has a largevalue (either positive or negative).  Then the demon can lower thecost quite a bit by choosing $\Delta z^l_j$ to have the opposite signto $\frac{\partial C}{\partial z^l_j}$.  By contrast, if$\frac{\partial C}{\partial z^l_j}$ is close to zero, then the demoncan't improve the cost much at all by perturbing the weighted input$z^l_j$.  So far as the demon can tell, the neuron is already prettynear optimal*<span class="marginnote">*This is only the case for small changes $\Delta  z^l_j$, of course. We'll assume that the demon is constrained to  make such small changes.</span>.  And so there's a heuristic sense inwhich $\frac{\partial C}{\partial z^l_j}$ is a measure of the error inthe neuron.</p><p>Motivated by this story, we define the error $\delta^l_j$ of neuron$j$ in layer $l$ by<a class="displaced_anchor" name="eqtn29"></a>\begin{eqnarray}   \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}.\tag{29}\end{eqnarray}As per our usual conventions, we use $\delta^l$ to denote the vectorof errors associated with layer $l$.  Backpropagation will give us away of computing $\delta^l$ for every layer, and then relating thoseerrors to the quantities of real interest, $\partial C / \partialw^l_{jk}$ and $\partial C / \partial b^l_j$.</p><p>You might wonder why the demon is changing the weighted input $z^l_j$.Surely it'd be more natural to imagine the demon changing the outputactivation $a^l_j$, with the result that we'd be using $\frac{\partial  C}{\partial a^l_j}$ as our measure of error.  In fact, if you dothis things work out quite similarly to the discussion below.  But itturns out to make the presentation of backpropagation a little morealgebraically complicated.  So we'll stick with $\delta^l_j =\frac{\partial C}{\partial z^l_j}$ as our measure of error*<span class="marginnote">*In  classification problems like MNIST the term "error" is sometimes  used to mean the classification failure rate.  E.g., if the neural  net correctly classifies 96.0 percent of the digits, then the error  is 4.0 percent.  Obviously, this has quite a different meaning from  our $\delta$ vectors.  In practice, you shouldn't have trouble  telling which meaning is intended in any given usage.</span>.</p><p><strong>Plan of attack:</strong> Backpropagation is based around fourfundamental equations.  Together, those equations give us a way ofcomputing both the error $\delta^l$ and the gradient of the costfunction.  I state the four equations below.  Be warned, though: youshouldn't expect to instantaneously assimilate the equations.  Such anexpectation will lead to disappointment.  In fact, the backpropagationequations are so rich that understanding them well requiresconsiderable time and patience as you gradually delve deeper into theequations.  The good news is that such patience is repaid many timesover.  And so the discussion in this section is merely a beginning,helping you on the way to a thorough understanding of the equations.</p><p>Here's a preview of the ways we'll delve more deeply into theequations later in the chapter: I'll<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">give  a short proof of the equations</a>, which helps explain why they aretrue; we'll <a href="chap2.html#the_backpropagation_algorithm">restate  the equations</a> in algorithmic form as pseudocode, and<a href="chap2.html#the_code_for_backpropagation">see how</a> thepseudocode can be implemented as real, running Python code; and, in<a href="chap2.html#backpropagation_the_big_picture">the final  section of the chapter</a>, we'll develop an intuitive picture of whatthe backpropagation equations mean, and how someone might discoverthem from scratch.  Along the way we'll return repeatedly to the fourfundamental equations, and as you deepen your understanding thoseequations will come to seem comfortable and, perhaps, even beautifuland natural.</p><p><strong>An equation for the error in the output layer, $\delta^L$:</strong>The components of $\delta^L$ are given by<a class="displaced_anchor" name="eqtnBP1"></a>\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).\tag{BP1}\end{eqnarray}This is a very natural expression.  The first term on the right,$\partial C / \partial a^L_j$, just measures how fast the cost ischanging as a function of the $j^{\rm th}$ output activation.  If, forexample, $C$ doesn't depend much on a particular output neuron, $j$,then $\delta^L_j$ will be small, which is what we'd expect.  Thesecond term on the right, $\sigma'(z^L_j)$, measures how fast theactivation function $\sigma$ is changing at $z^L_j$.</p><p>Notice that everything in <span id="margin_746332519947_reveal" class="equation_link">(BP1)</span><span id="margin_746332519947" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_746332519947_reveal').click(function() {$('#margin_746332519947').toggle('slow', function() {});});</script> is easily computed.  Inparticular, we compute $z^L_j$ while computing the behaviour of thenetwork, and it's only a small additional overhead to compute$\sigma'(z^L_j)$.  The exact form of $\partial C / \partial a^L_j$will, of course, depend on the form of the cost function.  However,provided the cost function is known there should be little troublecomputing $\partial C / \partial a^L_j$.  For example, if we're usingthe quadratic cost function then $C = \frac{1}{2} \sum_j(y_j-a^L_j)^2$, and so $\partial C / \partial a^L_j = (a_j^L-y_j)$,which obviously is easily computable.</p><p>Equation <span id="margin_55122493473_reveal" class="equation_link">(BP1)</span><span id="margin_55122493473" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_55122493473_reveal').click(function() {$('#margin_55122493473').toggle('slow', function() {});});</script> is a componentwise expression for $\delta^L$.It's a perfectly good expression, but not the matrix-based form wewant for backpropagation. However, it's easy to rewrite the equationin a matrix-based form, as<a class="displaced_anchor" name="eqtnBP1a"></a>\begin{eqnarray}   \delta^L = \nabla_a C \odot \sigma'(z^L).\tag{BP1a}\end{eqnarray}Here, $\nabla_a C$ is defined to be a vector whose components are thepartial derivatives $\partial C / \partial a^L_j$.  You can think of$\nabla_a C$ as expressing the rate of change of $C$ with respect tothe output activations.  It's easy to see that Equations <span id="margin_694778847298_reveal" class="equation_link">(BP1a)</span><span id="margin_694778847298" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1a" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L = \nabla_a C \odot \sigma'(z^L) \nonumber\end{eqnarray}</a></span><script>$('#margin_694778847298_reveal').click(function() {$('#margin_694778847298').toggle('slow', function() {});});</script>and <span id="margin_814281327209_reveal" class="equation_link">(BP1)</span><span id="margin_814281327209" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_814281327209_reveal').click(function() {$('#margin_814281327209').toggle('slow', function() {});});</script> are equivalent, and for that reason from now on we'lluse <span id="margin_347609596965_reveal" class="equation_link">(BP1)</span><span id="margin_347609596965" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_347609596965_reveal').click(function() {$('#margin_347609596965').toggle('slow', function() {});});</script> interchangeably to refer to both equations.  As anexample, in the case of the quadratic cost we have $\nabla_a C =(a^L-y)$, and so the fully matrix-based form of <span id="margin_580202085304_reveal" class="equation_link">(BP1)</span><span id="margin_580202085304" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_580202085304_reveal').click(function() {$('#margin_580202085304').toggle('slow', function() {});});</script> becomes<a class="displaced_anchor" name="eqtn30"></a>\begin{eqnarray}   \delta^L = (a^L-y) \odot \sigma'(z^L).\tag{30}\end{eqnarray}As you can see, everything in this expression has a nice vector form,and is easily computed using a library such as Numpy.</p><p><strong>An equation for the error $\delta^l$ in terms of the error in  the next layer, $\delta^{l+1}$:</strong> In particular<a class="displaced_anchor" name="eqtnBP2"></a>\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),\tag{BP2}\end{eqnarray}where $(w^{l+1})^T$ is the transpose of the weight matrix $w^{l+1}$ forthe $(l+1)^{\rm th}$ layer.  This equation appears complicated, buteach element has a nice interpretation.  Suppose we know the error$\delta^{l+1}$ at the $l+1^{\rm th}$ layer.  When we apply thetranspose weight matrix, $(w^{l+1})^T$, we can think intuitively ofthis as moving the error <em>backward</em> through the network, givingus some sort of measure of the error at the output of the $l^{\rm th}$layer.  We then take the Hadamard product $\odot \sigma'(z^l)$.  Thismoves the error backward through the activation function in layer $l$,giving us the error $\delta^l$ in the weighted input to layer $l$.</p><p>By combining <span id="margin_660897633688_reveal" class="equation_link">(BP2)</span><span id="margin_660897633688" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_660897633688_reveal').click(function() {$('#margin_660897633688').toggle('slow', function() {});});</script> with <span id="margin_609133724292_reveal" class="equation_link">(BP1)</span><span id="margin_609133724292" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_609133724292_reveal').click(function() {$('#margin_609133724292').toggle('slow', function() {});});</script> we can compute the error$\delta^l$ for any layer in the network.  We start byusing <span id="margin_715590989913_reveal" class="equation_link">(BP1)</span><span id="margin_715590989913" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_715590989913_reveal').click(function() {$('#margin_715590989913').toggle('slow', function() {});});</script> to compute $\delta^L$, then applyEquation <span id="margin_340221995720_reveal" class="equation_link">(BP2)</span><span id="margin_340221995720" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_340221995720_reveal').click(function() {$('#margin_340221995720').toggle('slow', function() {});});</script> to compute $\delta^{L-1}$, thenEquation <span id="margin_421054663689_reveal" class="equation_link">(BP2)</span><span id="margin_421054663689" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_421054663689_reveal').click(function() {$('#margin_421054663689').toggle('slow', function() {});});</script> again to compute $\delta^{L-2}$, and so on, allthe way back through the network.</p><p><strong>An equation for the rate of change of the cost with respect to  any bias in the network:</strong> In particular:<a class="displaced_anchor" name="eqtnBP3"></a>\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j.\tag{BP3}\end{eqnarray}That is, the error $\delta^l_j$ is <em>exactly equal</em> to the rate ofchange $\partial C / \partial b^l_j$.  This is great news, since<span id="margin_783953737713_reveal" class="equation_link">(BP1)</span><span id="margin_783953737713" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_783953737713_reveal').click(function() {$('#margin_783953737713').toggle('slow', function() {});});</script> and <span id="margin_596905285954_reveal" class="equation_link">(BP2)</span><span id="margin_596905285954" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_596905285954_reveal').click(function() {$('#margin_596905285954').toggle('slow', function() {});});</script> have already told us how to compute$\delta^l_j$.  We can rewrite <span id="margin_260441376709_reveal" class="equation_link">(BP3)</span><span id="margin_260441376709" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_260441376709_reveal').click(function() {$('#margin_260441376709').toggle('slow', function() {});});</script> in shorthand as<a class="displaced_anchor" name="eqtn31"></a>\begin{eqnarray}  \frac{\partial C}{\partial b} = \delta,\tag{31}\end{eqnarray}where it is understood that $\delta$ is being evaluated at the sameneuron as the bias $b$.</p><p><strong>An equation for the rate of change of the cost with respect to  any weight in the network:</strong>  In particular:<a class="displaced_anchor" name="eqtnBP4"></a>\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.\tag{BP4}\end{eqnarray}This tells us how to compute the partial derivatives $\partial C/ \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and$a^{l-1}$, which we already know how to compute.  The equation can berewritten in a less index-heavy notation as<a class="displaced_anchor" name="eqtn32"></a>\begin{eqnarray}  \frac{\partial    C}{\partial w} = a_{\rm in} \delta_{\rm out},\tag{32}\end{eqnarray}where it's understood that $a_{\rm in}$ is the activation of theneuron input to the weight $w$, and $\delta_{\rm out}$ is the error ofthe neuron output from the weight $w$.  Zooming in to look at just theweight $w$, and the two neurons connected by that weight, we candepict this as:<center><img src="images/tikz20.png"/></center>A nice consequence of Equation <span id="margin_762340755973_reveal" class="equation_link">(32)</span><span id="margin_762340755973" class="marginequation" style="display: none;"><a href="chap2.html#eqtn32" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w} = a_{\rm in} \delta_{\rm out} \nonumber\end{eqnarray}</a></span><script>$('#margin_762340755973_reveal').click(function() {$('#margin_762340755973').toggle('slow', function() {});});</script> isthat when the activation $a_{\rm in}$ is small, $a_{\rm in} \approx0$, the gradient term $\partial C / \partial w$ will also tend to besmall.  In this case, we'll say the weight <em>learns slowly</em>,meaning that it's not changing much during gradient descent.  In otherwords, one consequence of <span id="margin_396572057122_reveal" class="equation_link">(BP4)</span><span id="margin_396572057122" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_396572057122_reveal').click(function() {$('#margin_396572057122').toggle('slow', function() {});});</script> is that weights output fromlow-activation neurons learn slowly.</p><p><a name="saturation"></a></p><p>There are other insights along these lines which can be obtainedfrom <span id="margin_388896414764_reveal" class="equation_link">(BP1)</span><span id="margin_388896414764" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_388896414764_reveal').click(function() {$('#margin_388896414764').toggle('slow', function() {});});</script>-<span id="margin_659805817841_reveal" class="equation_link">(BP4)</span><span id="margin_659805817841" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_659805817841_reveal').click(function() {$('#margin_659805817841').toggle('slow', function() {});});</script>.  Let's start by looking at the outputlayer.  Consider the term $\sigma'(z^L_j)$ in <span id="margin_593318415394_reveal" class="equation_link">(BP1)</span><span id="margin_593318415394" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_593318415394_reveal').click(function() {$('#margin_593318415394').toggle('slow', function() {});});</script>.  Recallfrom the <a href="chap1.html#sigmoid_graph">graph of the sigmoid  function in the last chapter</a> that the $\sigma$ function becomesvery flat when $\sigma(z^L_j)$ is approximately $0$ or $1$.  When thisoccurs we will have $\sigma'(z^L_j) \approx 0$.  And so the lesson isthat a weight in the final layer will learn slowly if the outputneuron is either low activation ($\approx 0$) or high activation($\approx 1$).  In this case it's common to say the output neuron has<em>saturated</em> and, as a result, the weight has stopped learning (oris learning slowly).  Similar remarks hold also for the biases ofoutput neuron.</p><p>We can obtain similar insights for earlier layers.  In particular,note the $\sigma'(z^l)$ term in <span id="margin_190724977362_reveal" class="equation_link">(BP2)</span><span id="margin_190724977362" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_190724977362_reveal').click(function() {$('#margin_190724977362').toggle('slow', function() {});});</script>.  This means that$\delta^l_j$ is likely to get small if the neuron is near saturation.And this, in turn, means that any weights input to a saturated neuronwill learn slowly*<span class="marginnote">*This reasoning won't hold if ${w^{l+1}}^T  \delta^{l+1}$ has large enough entries to compensate for the  smallness of $\sigma'(z^l_j)$.  But I'm speaking of the general  tendency.</span>.</p><p>Summing up, we've learnt that a weight will learn slowly if either theinput neuron is low-activation, or if the output neuron has saturated,i.e., is either high- or low-activation.  </p><p>None of these observations is too greatly surprising.  Still, theyhelp improve our mental model of what's going on as a neural networklearns.  Furthermore, we can turn this type of reasoning around.  Thefour fundamental equations turn out to hold for any activationfunction, not just the standard sigmoid function (that's because, aswe'll see in a moment, the proofs don't use any special properties of$\sigma$).  And so we can use these equations to <em>design</em>activation functions which have particular desired learningproperties.  As an example to give you the idea, suppose we were tochoose a (non-sigmoid) activation function $\sigma$ so that $\sigma'$is always positive, and never gets close to zero.  That would preventthe slow-down of learning that occurs when ordinary sigmoid neuronssaturate.  Later in the book we'll see examples where this kind ofmodification is made to the activation function.  Keeping the fourequations <span id="margin_317805659118_reveal" class="equation_link">(BP1)</span><span id="margin_317805659118" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_317805659118_reveal').click(function() {$('#margin_317805659118').toggle('slow', function() {});});</script>-<span id="margin_281235820275_reveal" class="equation_link">(BP4)</span><span id="margin_281235820275" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_281235820275_reveal').click(function() {$('#margin_281235820275').toggle('slow', function() {});});</script> in mind can help explain why suchmodifications are tried, and what impact they can have.</p><p><a name="backpropsummary"></a></p><p><center><img src="images/tikz21.png"/></center></p><p><a id="alternative_backprop"></a></p><p><h4><a name="problem_543309"></a><a href="chap2.html#problem_543309">Problem</a></h4><ul><li><strong>Alternate presentation of the equations of backpropagation:</strong>  I've stated the equations of backpropagation (notably <span id="margin_369544886146_reveal" class="equation_link">(BP1)</span><span id="margin_369544886146" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_369544886146_reveal').click(function() {$('#margin_369544886146').toggle('slow', function() {});});</script>  and <span id="margin_411658597882_reveal" class="equation_link">(BP2)</span><span id="margin_411658597882" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_411658597882_reveal').click(function() {$('#margin_411658597882').toggle('slow', function() {});});</script>) using the Hadamard product.  This presentation may  be disconcerting if you're unused to the Hadamard product.  There's  an alternative approach, based on conventional matrix  multiplication, which some readers may find enlightening.  (1) Show  that <span id="margin_560817640248_reveal" class="equation_link">(BP1)</span><span id="margin_560817640248" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_560817640248_reveal').click(function() {$('#margin_560817640248').toggle('slow', function() {});});</script> may be rewritten as  <a class="displaced_anchor" name="eqtn33"></a>\begin{eqnarray}    \delta^L = \Sigma'(z^L) \nabla_a C,  \tag{33}\end{eqnarray}  where $\Sigma'(z^L)$ is a square matrix whose diagonal entries are  the values $\sigma'(z^L_j)$, and whose off-diagonal entries are  zero.  Note that this matrix acts on $\nabla_a C$ by conventional  matrix multiplication.  (2) Show that <span id="margin_411791280167_reveal" class="equation_link">(BP2)</span><span id="margin_411791280167" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_411791280167_reveal').click(function() {$('#margin_411791280167').toggle('slow', function() {});});</script> may be rewritten  as  <a class="displaced_anchor" name="eqtn34"></a>\begin{eqnarray}    \delta^l = \Sigma'(z^l) (w^{l+1})^T \delta^{l+1}.  \tag{34}\end{eqnarray}  (3) By combining observations (1) and (2) show that  <a class="displaced_anchor" name="eqtn35"></a>\begin{eqnarray}    \delta^l = \Sigma'(z^l) (w^{l+1})^T \ldots \Sigma'(z^{L-1}) (w^L)^T     \Sigma'(z^L) \nabla_a C  \tag{35}\end{eqnarray}  For readers comfortable with matrix multiplication this equation may  be easier to understand than <span id="margin_526265018367_reveal" class="equation_link">(BP1)</span><span id="margin_526265018367" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_526265018367_reveal').click(function() {$('#margin_526265018367').toggle('slow', function() {});});</script> and <span id="margin_838150496099_reveal" class="equation_link">(BP2)</span><span id="margin_838150496099" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_838150496099_reveal').click(function() {$('#margin_838150496099').toggle('slow', function() {});});</script>.  The  reason I've focused on <span id="margin_483851951417_reveal" class="equation_link">(BP1)</span><span id="margin_483851951417" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_483851951417_reveal').click(function() {$('#margin_483851951417').toggle('slow', function() {});});</script> and <span id="margin_275042918127_reveal" class="equation_link">(BP2)</span><span id="margin_275042918127" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_275042918127_reveal').click(function() {$('#margin_275042918127').toggle('slow', function() {});});</script> is because that  approach turns out to be faster to implement numerically.</ul></p><p><h3><a name="proof_of_the_four_fundamental_equations_(optional)"></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">Proof of the four fundamental equations (optional)</a></h3> </p><p>We'll now prove the four fundamentalequations <span id="margin_571954844130_reveal" class="equation_link">(BP1)</span><span id="margin_571954844130" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_571954844130_reveal').click(function() {$('#margin_571954844130').toggle('slow', function() {});});</script>-<span id="margin_804504789601_reveal" class="equation_link">(BP4)</span><span id="margin_804504789601" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_804504789601_reveal').click(function() {$('#margin_804504789601').toggle('slow', function() {});});</script>.  All four are consequences of thechain rule from multivariable calculus.  If you're comfortable withthe chain rule, then I strongly encourage you to attempt thederivation yourself before reading on.</p><p>Let's begin with Equation <span id="margin_611970553467_reveal" class="equation_link">(BP1)</span><span id="margin_611970553467" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_611970553467_reveal').click(function() {$('#margin_611970553467').toggle('slow', function() {});});</script>, which gives an expression forthe output error, $\delta^L$.  To prove this equation, recall that bydefinition<a class="displaced_anchor" name="eqtn36"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial z^L_j}.\tag{36}\end{eqnarray}Applying the chain rule, we can re-express the partial derivativeabove in terms of partial derivatives with respect to the outputactivations,<a class="displaced_anchor" name="eqtn37"></a>\begin{eqnarray}  \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j},\tag{37}\end{eqnarray}where the sum is over all neurons $k$ in the output layer.  Of course,the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends onlyon the weighted input $z^L_j$ for the $j^{\rm th}$ neuron when $k =j$.  And so $\partial a^L_k / \partial z^L_j$ vanishes when $k \neqj$.  As a result we can simplify the previous equation to<a class="displaced_anchor" name="eqtn38"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}.\tag{38}\end{eqnarray}Recalling that $a^L_j = \sigma(z^L_j)$ the second term on the rightcan be written as $\sigma'(z^L_j)$, and the equation becomes<a class="displaced_anchor" name="eqtn39"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j),\tag{39}\end{eqnarray}which is just <span id="margin_10696386908_reveal" class="equation_link">(BP1)</span><span id="margin_10696386908" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_10696386908_reveal').click(function() {$('#margin_10696386908').toggle('slow', function() {});});</script>, in component form.</p><p>Next, we'll prove <span id="margin_804618613400_reveal" class="equation_link">(BP2)</span><span id="margin_804618613400" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_804618613400_reveal').click(function() {$('#margin_804618613400').toggle('slow', function() {});});</script>, which gives an equation for the error$\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$.To do this, we want to rewrite $\delta^l_j = \partial C / \partialz^l_j$ in terms of $\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$.We can do this using the chain rule,<a class="displaced_anchor" name="eqtn40"></a><a class="displaced_anchor" name="eqtn41"></a><a class="displaced_anchor" name="eqtn42"></a>\begin{eqnarray}  \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\  & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\   & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k,\tag{42}\end{eqnarray}where in the last line we have interchanged the two terms on theright-hand side, and substituted the definition of $\delta^{l+1}_k$.To evaluate the first term on the last line, note that<a class="displaced_anchor" name="eqtn43"></a>\begin{eqnarray}  z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k.\tag{43}\end{eqnarray}Differentiating, we obtain<a class="displaced_anchor" name="eqtn44"></a>\begin{eqnarray}  \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j).\tag{44}\end{eqnarray}Substituting back into <span id="margin_793355739055_reveal" class="equation_link">(42)</span><span id="margin_793355739055" class="marginequation" style="display: none;"><a href="chap2.html#eqtn42" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \nonumber\end{eqnarray}</a></span><script>$('#margin_793355739055_reveal').click(function() {$('#margin_793355739055').toggle('slow', function() {});});</script> we obtain<a class="displaced_anchor" name="eqtn45"></a>\begin{eqnarray}  \delta^l_j = \sum_k w^{l+1}_{kj}  \delta^{l+1}_k \sigma'(z^l_j).\tag{45}\end{eqnarray}This is just <span id="margin_364636395746_reveal" class="equation_link">(BP2)</span><span id="margin_364636395746" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_364636395746_reveal').click(function() {$('#margin_364636395746').toggle('slow', function() {});});</script> written in component form.</p><p>The final two equations we want to prove are <span id="margin_177998023405_reveal" class="equation_link">(BP3)</span><span id="margin_177998023405" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_177998023405_reveal').click(function() {$('#margin_177998023405').toggle('slow', function() {});});</script>and <span id="margin_621206336041_reveal" class="equation_link">(BP4)</span><span id="margin_621206336041" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_621206336041_reveal').click(function() {$('#margin_621206336041').toggle('slow', function() {});});</script>.  These also follow from the chain rule, in a mannersimilar to the proofs of the two equations above.  I leave them to youas an exercise. </p><p><h4><a name="exercise_522523"></a><a href="chap2.html#exercise_522523">Exercise</a></h4><ul><li> Prove Equations <span id="margin_962925629832_reveal" class="equation_link">(BP3)</span><span id="margin_962925629832" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_962925629832_reveal').click(function() {$('#margin_962925629832').toggle('slow', function() {});});</script> and <span id="margin_245358521025_reveal" class="equation_link">(BP4)</span><span id="margin_245358521025" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_245358521025_reveal').click(function() {$('#margin_245358521025').toggle('slow', function() {});});</script>.</ul></p><p>That completes the proof of the four fundamental equations ofbackpropagation.  The proof may seem complicated.  But it's reallyjust the outcome of carefully applying the chain rule.  A little lesssuccinctly, we can think of backpropagation as a way of computing thegradient of the cost function by systematically applying the chainrule from multi-variable calculus.  That's all there really is tobackpropagation - the rest is details.</p><p><h3><a name="the_backpropagation_algorithm"></a><a href="chap2.html#the_backpropagation_algorithm">The backpropagation algorithm</a></h3></p><p>The backpropagation equations provide us with a way of computing thegradient of the cost function.  Let's explicitly write this out in theform of an algorithm:<ol><li> <strong>Input $x$:</strong> Set the corresponding activation $a^{1}$ for  the input layer.  </p><p><li> <strong>Feedforward:</strong> For each $l = 2, 3, \ldots, L$ compute  $z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$.</p><p><li> <strong>Output error $\delta^L$:</strong> Compute the vector $\delta^{L}  = \nabla_a C \odot \sigma'(z^L)$.</p><p><li> <strong>Backpropagate the error:</strong> For each $l = L-1, L-2,  \ldots, 2$ compute $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot  \sigma'(z^{l})$.</p><p><li> <strong>Output:</strong> The gradient of the cost function is given by  $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and  $\frac{\partial C}{\partial b^l_j} = \delta^l_j$.</ol></p><p>Examining the algorithm you can see why it's called<em>back</em>propagation.  We compute the error vectors $\delta^l$backward, starting from the final layer.  It may seem peculiar thatwe're going through the network backward.  But if you think about theproof of backpropagation, the backward movement is a consequence ofthe fact that the cost is a function of outputs from the network.  Tounderstand how the cost varies with earlier weights and biases we needto repeatedly apply the chain rule, working backward through thelayers to obtain usable expressions.</p><p><h4><a name="exercises_675621"></a><a href="chap2.html#exercises_675621">Exercises</a></h4><ul><li><strong>Backpropagation with a single modified neuron</strong> Suppose we modify  a single neuron in a feedforward network so that the output from the  neuron is given by $f(\sum_j w_j x_j + b)$, where $f$ is some  function other than the sigmoid.  How should we modify the  backpropagation algorithm in this case?</p><p><li><strong>Backpropagation with linear neurons</strong> Suppose we replace the  usual non-linear $\sigma$ function with $\sigma(z) = z$ throughout  the network.  Rewrite the backpropagation algorithm for this case.</ul></p><p>As I've described it above, the backpropagation algorithm computes thegradient of the cost function for a single training example, $C =C_x$.  In practice, it's common to combine backpropagation with alearning algorithm such as stochastic gradient descent, in which wecompute the gradient for many training examples.  In particular, givena mini-batch of $m$ training examples, the following algorithm appliesa gradient descent learning step based on that mini-batch:<ol><li> <strong>Input a set of training examples</strong></p><p><li> <strong>For each training example $x$:</strong> Set the corresponding  input activation $a^{x,1}$, and perform the following steps:</p><p><ul><li> <strong>Feedforward:</strong> For each $l = 2, 3, \ldots, L$ compute  $z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$.</p><p><li> <strong>Output error $\delta^{x,L}$:</strong> Compute the vector  $\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$.</p><p><li> <strong>Backpropagate the error:</strong> For each $l = L-1, L-2,  \ldots, 2$ compute $\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1})  \odot \sigma'(z^{x,l})$.</ul></p><p><li> <strong>Gradient descent:</strong> For each $l = L, L-1, \ldots, 2$  update the weights according to the rule $w^l \rightarrow  w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, and the  biases according to the rule $b^l \rightarrow b^l-\frac{\eta}{m}  \sum_x \delta^{x,l}$.</p><p></ol>Of course, to implement stochastic gradient descent in practice youalso need an outer loop generating mini-batches of training examples,and an outer loop stepping through multiple epochs of training.  I'veomitted those for simplicity.  </p><p></p><p><h3><a name="the_code_for_backpropagation"></a><a href="chap2.html#the_code_for_backpropagation">The code for backpropagation</a></h3></p><p>Having understood backpropagation in the abstract, we can nowunderstand the code used in the last chapter to implementbackpropagation.  Recall from<a href="chap1.html#implementing_our_network_to_classify_digits">that  chapter</a> that the code was contained in the <tt>update_mini_batch</tt>and <tt>backprop</tt> methods of the <tt>Network</tt> class.  The code forthese methods is a direct translation of the algorithm describedabove.  In particular, the <tt>update_mini_batch</tt> method updates the<tt>Network</tt>'s weights and biases by computing the gradient for thecurrent <tt>mini_batch</tt> of training examples:<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="o">...</span>
    <span class="k">def</span> <span class="nf">update_mini_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Update the network&#39;s weights and biases by applying</span>
<span class="sd">        gradient descent using backpropagation to a single mini batch.</span>
<span class="sd">        The &quot;mini_batch&quot; is a list of tuples &quot;(x, y)&quot;, and &quot;eta&quot;</span>
<span class="sd">        is the learning rate.&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">mini_batch</span><span class="p">:</span>
            <span class="n">delta_nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_w</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
            <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">nb</span><span class="o">+</span><span class="n">dnb</span> <span class="k">for</span> <span class="n">nb</span><span class="p">,</span> <span class="n">dnb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_b</span><span class="p">)]</span>
            <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">nw</span><span class="o">+</span><span class="n">dnw</span> <span class="k">for</span> <span class="n">nw</span><span class="p">,</span> <span class="n">dnw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_w</span><span class="p">,</span> <span class="n">delta_nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nw</span> 
                        <span class="k">for</span> <span class="n">w</span><span class="p">,</span> <span class="n">nw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">b</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nb</span> 
                       <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">nb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="n">nabla_b</span><span class="p">)]</span>
</pre></div>
Most of the work is done by the line<tt>delta_nabla_b, delta_nabla_w = self.backprop(x, y)</tt> which usesthe <tt>backprop</tt> method to figure out the partial derivatives$\partial C_x / \partial b^l_j$ and $\partial C_x / \partialw^l_{jk}$.  The <tt>backprop</tt> method follows the algorithm in thelast section closely.  There is one small change - we use a slightlydifferent approach to indexing the layers.  This change is made totake advantage of a feature of Python, namely the use of negative listindices to count backward from the end of a list, so, e.g.,<tt>l[-3]</tt> is the third last entry in a list <tt>l</tt>.  The code for<tt>backprop</tt> is below, together with a few helper functions, whichare used to compute the $\sigma$ function, the derivative $\sigma'$,and the derivative of the cost function.  With these inclusions youshould be able to understand the code in a self-contained way.  Ifsomething's tripping you up, you may find it helpful to consult<a href="chap1.html#implementing_our_network_to_classify_digits">the  original description (and complete listing) of the code</a>.<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="o">...</span>
   <span class="k">def</span> <span class="nf">backprop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return a tuple &quot;(nabla_b, nabla_w)&quot; representing the</span>
<span class="sd">        gradient for the cost function C_x.  &quot;nabla_b&quot; and</span>
<span class="sd">        &quot;nabla_w&quot; are layer-by-layer lists of numpy arrays, similar</span>
<span class="sd">        to &quot;self.biases&quot; and &quot;self.weights&quot;.&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="c1"># feedforward</span>
        <span class="n">activation</span> <span class="o">=</span> <span class="n">x</span>
        <span class="n">activations</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="c1"># list to store all the activations, layer by layer</span>
        <span class="n">zs</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># list to store all the z vectors, layer by layer</span>
        <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">activation</span><span class="p">)</span><span class="o">+</span><span class="n">b</span>
            <span class="n">zs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activation</span> <span class="o">=</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activations</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">activation</span><span class="p">)</span>
        <span class="c1"># backward pass</span>
        <span class="n">delta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost_derivative</span><span class="p">(</span><span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> \
            <span class="n">sigmoid_prime</span><span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
        <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
        <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="c1"># Note that the variable l in the loop below is used a little</span>
        <span class="c1"># differently to the notation in Chapter 2 of the book.  Here,</span>
        <span class="c1"># l = 1 means the last layer of neurons, l = 2 is the</span>
        <span class="c1"># second-last layer, and so on.  It&#39;s a renumbering of the</span>
        <span class="c1"># scheme in the book, used here to take advantage of the fact</span>
        <span class="c1"># that Python can use negative indices in lists.</span>
        <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span>
            <span class="n">sp</span> <span class="o">=</span> <span class="n">sigmoid_prime</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">delta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">delta</span><span class="p">)</span> <span class="o">*</span> <span class="n">sp</span>
            <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
            <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)</span>

<span class="o">...</span>

    <span class="k">def</span> <span class="nf">cost_derivative</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">output_activations</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the vector of partial derivatives \partial C_x /</span>
<span class="sd">        \partial a for the output activations.&quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">output_activations</span><span class="o">-</span><span class="n">y</span><span class="p">)</span> 

<span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;The sigmoid function.&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="mf">1.0</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">sigmoid_prime</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Derivative of the sigmoid function.&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
</pre></div>
</p><p><a id="backprop_over_minibatch"></a><h4><a name="problem_269962"></a><a href="chap2.html#problem_269962">Problem</a></h4><ul><li><strong>Fully matrix-based approach to backpropagation over a  mini-batch</strong> Our implementation of stochastic gradient descent loops  over training examples in a mini-batch.  It's possible to modify the  backpropagation algorithm so that it computes the gradients for all  training examples in a mini-batch simultaneously.  The idea is that  instead of beginning with a single input vector, $x$, we can begin  with a matrix $X = [x_1 x_2 \ldots x_m]$ whose columns are the  vectors in the mini-batch.  We forward-propagate by multiplying by  the weight matrices, adding a suitable matrix for the bias terms,  and applying the sigmoid function everywhere. We backpropagate along  similar lines.  Explicitly write out pseudocode for this approach to  the backpropagation algorithm.  Modify <tt>network.py</tt> so that it  uses this fully matrix-based approach.  The advantage of this  approach is that it takes full advantage of modern libraries for  linear algebra.  As a result it can be quite a bit faster than  looping over the mini-batch.  (On my laptop, for example, the  speedup is about a factor of two when run on MNIST classification  problems like those we considered in the last chapter.)  In  practice, all serious libraries for backpropagation use this fully  matrix-based approach or some variant.</ul></p><p><h3><a name="in_what_sense_is_backpropagation_a_fast_algorithm"></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm">In what sense is backpropagation a fast algorithm?</a></h3></p><p>In what sense is backpropagation a fast algorithm?  To answer thisquestion, let's consider another approach to computing the gradient.Imagine it's the early days of neural networks research.  Maybe it'sthe 1950s or 1960s, and you're the first person in the world to thinkof using gradient descent to learn!  But to make the idea work youneed a way of computing the gradient of the cost function.  You thinkback to your knowledge of calculus, and decide to see if you can usethe chain rule to compute the gradient.  But after playing around abit, the algebra looks complicated, and you get discouraged.  So youtry to find another approach.  You decide to regard the cost as afunction of the weights $C = C(w)$ alone (we'll get back to the biasesin a moment).  You number the weights $w_1, w_2, \ldots$, and want tocompute $\partial C / \partial w_j$ for some particular weight $w_j$.An obvious way of doing that is to use the approximation<a class="displaced_anchor" name="eqtn46"></a>\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon},\tag{46}\end{eqnarray}where $\epsilon > 0$ is a small positive number, and $e_j$ is the unitvector in the $j^{\rm th}$ direction.  In other words, we can estimate$\partial C / \partial w_j$ by computing the cost $C$ for two slightlydifferent values of $w_j$, and then applyingEquation <span id="margin_150243494261_reveal" class="equation_link">(46)</span><span id="margin_150243494261" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_150243494261_reveal').click(function() {$('#margin_150243494261').toggle('slow', function() {});});</script>.  The same idea will let uscompute the partial derivatives $\partial C / \partial b$ with respectto the biases.</p><p>This approach looks very promising.  It's simple conceptually, andextremely easy to implement, using just a few lines of code.Certainly, it looks much more promising than the idea of using thechain rule to compute the gradient!</p><p>Unfortunately, while this approach appears promising, when youimplement the code it turns out to be extremely slow.  To understandwhy, imagine we have a million weights in our network.  Then for eachdistinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in orderto compute $\partial C / \partial w_j$.  That means that to computethe gradient we need to compute the cost function a million differenttimes, requiring a million forward passes through the network (pertraining example).  We need to compute $C(w)$ as well, so that's atotal of a million and one passes through the network.</p><p>What's clever about backpropagation is that it enables us tosimultaneously compute <em>all</em> the partial derivatives $\partial C/ \partial w_j$ using just one forward pass through the network,followed by one backward pass through the network.  Roughly speaking,the computational cost of the backward pass is about the same as theforward pass*<span class="marginnote">*This should be plausible, but it requires some  analysis to make a careful statement.  It's plausible because the  dominant computational cost in the forward pass is multiplying by  the weight matrices, while in the backward pass it's multiplying by  the transposes of the weight matrices.  These operations obviously  have similar computational cost.</span>.  And so the total cost ofbackpropagation is roughly the same as making just two forward passesthrough the network.  Compare that to the million and one forwardpasses we needed for the approach basedon <span id="margin_294942093813_reveal" class="equation_link">(46)</span><span id="margin_294942093813" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_294942093813_reveal').click(function() {$('#margin_294942093813').toggle('slow', function() {});});</script>!  And so even though backpropagationappears superficially more complex than the approach basedon <span id="margin_302280124955_reveal" class="equation_link">(46)</span><span id="margin_302280124955" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_302280124955_reveal').click(function() {$('#margin_302280124955').toggle('slow', function() {});});</script>, it's actually much, much faster.</p><p>This speedup was first fully appreciated in 1986, and it greatlyexpanded the range of problems that neural networks could solve.That, in turn, caused a rush of people using neural networks.  Ofcourse, backpropagation is not a panacea.  Even in the late 1980speople ran up against limits, especially when attempting to usebackpropagation to train deep neural networks, i.e., networks withmany hidden layers.  Later in the book we'll see how modern computersand some clever new ideas now make it possible to use backpropagationto train such deep neural networks.</p><p><h3><a name="backpropagation_the_big_picture"></a><a href="chap2.html#backpropagation_the_big_picture">Backpropagation: the big picture</a></h3></p><p>As I've explained it, backpropagation presents two mysteries.  First,what's the algorithm really doing?  We've developed a picture of theerror being backpropagated from the output.  But can we go any deeper,and build up more intuition about what is going on when we do allthese matrix and vector multiplications?  The second mystery is howsomeone could ever have discovered backpropagation in the first place?It's one thing to follow the steps in an algorithm, or even to followthe proof that the algorithm works.  But that doesn't mean youunderstand the problem so well that you could have discovered thealgorithm in the first place.  Is there a plausible line of reasoningthat could have led you to discover the backpropagation algorithm?  Inthis section I'll address both these mysteries.</p><p>To improve our intuition about what the algorithm is doing, let'simagine that we've made a small change $\Delta w^l_{jk}$ to someweight in the network, $w^l_{jk}$:<center><img src="images/tikz22.png"/></center>That change in weight will cause a change in the output activationfrom the corresponding neuron:<center><img src="images/tikz23.png"/></center>That, in turn, will cause a change in <em>all</em> the activations inthe next layer:<center><img src="images/tikz24.png"/></center>Those changes will in turn cause changes in the next layer, and thenthe next, and so on all the way through to causing a change in thefinal layer, and then in the cost function:<center><img src="images/tikz25.png"/></center>The change $\Delta C$ in the cost is related to the change $\Deltaw^l_{jk}$ in the weight by the equation<a class="displaced_anchor" name="eqtn47"></a>\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{47}\end{eqnarray}This suggests that a possible approach to computing $\frac{\partial  C}{\partial w^l_{jk}}$ is to carefully track how a small change in$w^l_{jk}$ propagates to cause a small change in $C$.  If we can dothat, being careful to express everything along the way in terms ofeasily computable quantities, then we should be able to compute$\partial C / \partial w^l_{jk}$.</p><p>Let's try to carry this out.  The change $\Delta w^l_{jk}$ causes asmall change $\Delta a^{l}_j$ in the activation of the $j^{\rm th}$ neuron inthe $l^{\rm th}$ layer.  This change is given by<a class="displaced_anchor" name="eqtn48"></a>\begin{eqnarray}   \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{48}\end{eqnarray}The change in activation $\Delta a^l_{j}$ will cause changes in<em>all</em> the activations in the next layer, i.e., the $(l+1)^{\rm  th}$ layer.  We'll concentrate on the way just a single one of thoseactivations is affected, say $a^{l+1}_q$,<center><img src="images/tikz26.png"/></center>In fact, it'll cause the following change:<a class="displaced_anchor" name="eqtn49"></a>\begin{eqnarray}  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j.\tag{49}\end{eqnarray}Substituting in the expression from Equation <span id="margin_997832314879_reveal" class="equation_link">(48)</span><span id="margin_997832314879" class="marginequation" style="display: none;"><a href="chap2.html#eqtn48" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_997832314879_reveal').click(function() {$('#margin_997832314879').toggle('slow', function() {});});</script>,we get:<a class="displaced_anchor" name="eqtn50"></a>\begin{eqnarray}  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{50}\end{eqnarray}Of course, the change $\Delta a^{l+1}_q$ will, in turn, cause changesin the activations in the next layer.  In fact, we can imagine a pathall the way through the network from $w^l_{jk}$ to $C$, with eachchange in activation causing a change in the next activation, and,finally, a change in the cost at the output.  If the path goes throughactivations $a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m$ then theresulting expression is<a class="displaced_anchor" name="eqtn51"></a>\begin{eqnarray}  \Delta C \approx \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}  \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},\tag{51}\end{eqnarray}that is, we've picked up a $\partial a / \partial a$ type term foreach additional neuron we've passed through, as well as the $\partialC/\partial a^L_m$ term at the end.  This represents the change in $C$due to changes in the activations along this particular path throughthe network.  Of course, there's many paths by which a change in$w^l_{jk}$ can propagate to affect the cost, and we've beenconsidering just a single path. To compute the total change in $C$ itis plausible that we should sum over all the possible paths betweenthe weight and the final cost, i.e.,<a class="displaced_anchor" name="eqtn52"></a>\begin{eqnarray}   \Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},\tag{52}\end{eqnarray}where we've summed over all possible choices for the intermediateneurons along the path.  Comparing with <span id="margin_428032444217_reveal" class="equation_link">(47)</span><span id="margin_428032444217" class="marginequation" style="display: none;"><a href="chap2.html#eqtn47" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_428032444217_reveal').click(function() {$('#margin_428032444217').toggle('slow', function() {});});</script> wesee that<a class="displaced_anchor" name="eqtn53"></a>\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}}.\tag{53}\end{eqnarray}Now, Equation <span id="margin_315505332617_reveal" class="equation_link">(53)</span><span id="margin_315505332617" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_315505332617_reveal').click(function() {$('#margin_315505332617').toggle('slow', function() {});});</script> looks complicated.  However,it has a nice intuitive interpretation.  We're computing the rate ofchange of $C$ with respect to a weight in the network.  What theequation tells us is that every edge between two neurons in thenetwork is associated with a rate factor which is just the partialderivative of one neuron's activation with respect to the otherneuron's activation.  The edge from the first weight to the firstneuron has a rate factor $\partial a^{l}_j / \partial w^l_{jk}$.  Therate factor for a path is just the product of the rate factors alongthe path.  And the total rate of change $\partial C / \partialw^l_{jk}$ is just the sum of the rate factors of all paths from theinitial weight to the final cost.  This procedure is illustrated here,for a single path:<center><img src="images/tikz27.png"/></center></p><p>What I've been providing up to now is a heuristic argument, a way ofthinking about what's going on when you perturb a weight in a network.Let me sketch out a line of thinking you could use to further developthis argument.  First, you could derive explicit expressions for allthe individual partial derivatives inEquation <span id="margin_269499726515_reveal" class="equation_link">(53)</span><span id="margin_269499726515" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_269499726515_reveal').click(function() {$('#margin_269499726515').toggle('slow', function() {});});</script>.  That's easy to do with a bit ofcalculus.  Having done that, you could then try to figure out how towrite all the sums over indices as matrix multiplications.  This turnsout to be tedious, and requires some persistence, but notextraordinary insight.  After doing all this, and then simplifying asmuch as possible, what you discover is that you end up with exactlythe backpropagation algorithm!  And so you can think of thebackpropagation algorithm as providing a way of computing the sum overthe rate factor for all these paths.  Or, to put it slightlydifferently, the backpropagation algorithm is a clever way of keepingtrack of small perturbations to the weights (and biases) as theypropagate through the network, reach the output, and then affect thecost.</p><p>Now, I'm not going to work through all this here.  It's messy andrequires considerable care to work through all the details.  If you'reup for a challenge, you may enjoy attempting it.  And even if not, Ihope this line of thinking gives you some insight into whatbackpropagation is accomplishing.</p><p>What about the other mystery - how backpropagation could have beendiscovered in the first place?  In fact, if you follow the approach Ijust sketched you will discover a proof of backpropagation.Unfortunately, the proof is quite a bit longer and more complicatedthan the one I described earlier in this chapter.  So how was thatshort (but more mysterious) proof discovered?  What you find when youwrite out all the details of the long proof is that, after the fact,there are several obvious simplifications staring you in the face.You make those simplifications, get a shorter proof, and write thatout.  And then several more obvious simplifications jump out atyou. So you repeat again.  The result after a few iterations is theproof we saw earlier*<span class="marginnote">*There is one clever step required.  In  Equation <span id="margin_261654525149_reveal" class="equation_link">(53)</span><span id="margin_261654525149" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_261654525149_reveal').click(function() {$('#margin_261654525149').toggle('slow', function() {});});</script> the intermediate variables are  activations like $a_q^{l+1}$.  The clever idea is to switch to using  weighted inputs, like $z^{l+1}_q$, as the intermediate variables. If  you don't have this idea, and instead continue using the activations  $a^{l+1}_q$, the proof you obtain turns out to be slightly more  complex than the proof given earlier in the chapter.</span> - short, butsomewhat obscure, because all the signposts to its construction havebeen removed!  I am, of course, asking you to trust me on this, butthere really is no great mystery to the origin of the earlier proof.It's just a lot of hard work simplifying the proof I've sketched inthis section.</p><p><br/><br/><br/></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2015

<br/>
<br/>

This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>.  This means you're free to copy, share, and
build on this book, but not to sell it.  If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Thu Dec 26 15:26:33 2019
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
  ga('send', 'pageview');

</script>
</body>
</html>